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We propose an iterative estimating equations procedure for anal- 
ysis of longitudinal data. We show that, under very mild conditions, 
the probability that the procedure converges at an exponential rate 
tends to one as the sample size increases to infinity. Furthermore, 
we show that the limiting estimator is consistent and asymptotically 
efficient, as expected. The method applies to semiparametric regres- 
sion models with unspecified covariances among the observations, fn 
the special case of linear models, the procedure reduces to iterative 
reweighted least squares. Finite sample performance of the procedure 
is studied by simulations, and compared with other methods. A nu- 
merical example from a medical study is considered to illustrate the 
application of the method. 

1. Introduction. Longitudinal data is often encountered in medical re- 
search and economics studies. In the analysis of longitudinal data (e.g., Dig- 
gle, Liang and Zeger [3]), the problem of main interest is often related to 
the estimation of the mean responses which, under a suitable parametric or 
semiparametric model, depend on a vector [3 of unknown parameters. How- 
ever, the mission is complicated by the fact that the responses are correlated 
and the correlations are unknown. 

Suppose that Y is a vector of responses that is associated with a ma- 
trix X of covariates, which may also be random. Suppose that the (condi- 
tional) mean of Y is associated with a vector of parameters, 9. For notational 
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simplicity, write jjL = fi(X,8) = ~Eg(Y\X) and V = Yax(Y\X). Hereafter Var 
or E without the subscript 9 is meant to be taken at the true 8. Con- 
sider the following class of estimating functions Q = {G = A(Y — fj,)}, where 
A = A(X, 9). Suppose that V is known. Then, by Theorem 2.1 of Heyde [7], 
it is easy to show that the optimal estimating function within Q is given 
by G* = fi'V^iY - n), that is, with A* = fi'V' 1 . Therefore, the optimal 
estimating equation is given by ji'V^iY — fj,) = 0. In the case of longitu- 
dinal data, the responses are clustered according to the subjects. Let Y be 
the vector of responses collected from the iih subject, and Xi the matrix 
of covariates associated with Yj. Let fii = E(Y|Xj) = fii(Xi,/3), where (3 is 
a vector of unknown parameters. Then, under the assumption that (Xi, Yj), 
i = 1, . . . ,n, are uncorrelated with known Vi = Var(Yj|Xj), the optimal esti- 
mating equation (for /3) is given by Yh=i fri^T 0% ~ ^i) = 0> which is known 
as the generalized estimating equations (GEE) (e.g., Diggle, Liang and Zeger 
[3])- 

However, the optimal GEE depends on Vi, 1 < i < n, which are usually un- 
known in practice. Therefore, the optimal GEE estimator is not computable. 
The problem of obtaining an (asymptotically) optimal, or efficient, estima- 
tor of P without knowing the V^'s is the main concern of the current paper. 
To motivate our approach, let us first consider a simple example. Suppose 
that the Y's satisfy a (classical) linear model, that is, E(Y) = Xi/3, where 
Xi is a matrix of fixed covariates, Var(Y) = V\, where V\ is an unknown 
fixed covariance matrix, and Y\,...,Y n are independent. If V\ is known, (3 
can be estimated by the following best linear unbiased estimator (BLUE), 
which is the optimal GEE estimator in this special case: 

(n \ — 1 n 

i=l / i=l 

provided that Y% =1 X<V{- 1 X i is nonsingular. On the other hand, if (3 is 
known, V\ can be estimated consistently as 

1 n 

(1.2) Vi = - V(Y - Xi(3)(Yi - Xi/3)'. 

n f— '. 

i=i 

It is clear that there is a cycle, which motivates the following algorithm, 
when neither /3 nor V\ is known. Starting with the identity matrix / for V\, 
use (1.1) with V\ replaced by / to obtain the initial estimator for /3, which is 
known as the ordinary least squares (OLS) estimator; then use (1.2) with /3 
replaced by the OLS estimator to update V\\ then use (1.1) with the new V\ 
to update 0, and so on. The procedure is expected to result in an estimator 
that is, at least, more efficient than the OLS estimator; but can we ask a 
further question, that is, is the procedure going to produce an estimator 
that is asymptotically as efficient as the BLUE? Before this question can 
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be answered, however, another issue needs to be resolved first, that is, does 
the iterative procedure converge? These questions will be answered in the 
sequel. 

The example considered above is called balanced data, in which the ob- 
servations are collected at a common set of times for all the subjects. In 
this case, the procedure (without iterations) is known as robust estimation 
for analysis of longitudinal data. However, as pointed out by Diggle, Liang 
and Zeger [3] , page 77, so far the method has been restricted to the case of 
balanced data. In many practical situations, however, the data is seriously 
unbalanced, that is, the time sets at which the observations are collected 
vary from subject to subject. An extension of the robust estimation method 
to unbalanced longitudinal data will be given. 

In fact, we are going to consider a statistical model for the longitudi- 
nal data that is much more general than the above example, or the case 
of balanced data, and propose a similar iterative procedure that not only 
converges, but converges exponentially. The latter is the main theoretical 
finding of this paper. Furthermore, at convergence the iterative procedure 
produces an estimator of (3 that is asymptotically as efficient as the optimal 
GEE estimator. In Section 2 we propose a semiparametric regression model 
for the longitudinal data. An iterative estimating equations (IEE) procedure 
is proposed in Section 3 under the semiparametric model. The convergence 
property as well as asymptotic behavior of the limiting estimator are studied 
in Section 4. In Section 5 we consider a special case of our general model, 
the case of (classical) linear models. Some simulation studies are carried out 
in Section 6 to investigate empirically the performance of the IEE and its 
comparison with other methods. A numerical example using data from a 
medical study is considered in Section 7. Some further discussion and con- 
cluding remarks are given in Section 8. Proofs and other technical details 
are deferred to Section 9. 

2. A semiparametric regression model. We consider a follow-up study 
conducted over a set of prespecified visit times t\,.. . Suppose that the 
responses are collected from subject i at the visit times tj, j G Jj C J = 
{1, . . . , b}. Let Yi = (Yij)j£j t . Here we allow the visit times to be dependent 
on the subject. Let X^ = (Xiji)i<i< p represent a vector of explanatory vari- 
ables associated with Yij so that the first component of Xij corresponds to 
an intercept (i.e., X {jl = 1). Write X; = (X i j)j & j i = (A^/^gj^i^/^p. Note 
that Xi may include both time dependent and independent covariates so 
that, w.l.o.g., it may be expressed as Xi = (JQi, A^), where the elements of 
Xn do not depend on j (i.e., time) while those of Xj2 do. We assume that 
(Xi,Yi), i = 1, . . . ,n, are independent. Furthermore, it is assumed that 
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where (5 is a p x 1 vector of unknown regression coefficients and g j(v) 
are fixed functions. We use the notation fjLij = E(Yij\Xi) and ^ = (fJ>ij)jeJi 
throughout this paper. Note that fii = E(Yi\Xi). In addition, denote the 
(conditional) covariance matrix of Yi given Xj as 

(2.2) V i = Vaj:(Y i \X i ), 

whose (j, k)th element is v ijk = cov(Yij,Y ik \Xi) = E{(Y^ -fj^j )(Y ik - fj, ik )\Xi}, 
j, k G Jj. Note that the dimension of Vi may depend on i. Let D = {(j, k) :j,k£ 
Ji for some 1 < i < n}. The following assumption is made. 

Assumption Al. For any (j, k) e D, the number of different fife's 
is bounded, that is, for each (j,k) € D, there is a set of numbers Vj k = 
{v(j,k,l), 1 < I < Ljfc}, where Lj/c is bounded, such that E Vjfc for any 
1 < i < n with j, k £ J{. 

Since the number of visit times, b, is assumed fixed, an equivalence to 
Assumption Al follows. 

Lemma 1. Assumption Al holds if and only if the number of different 
Vi s is bounded, that is, there is a set of covariance matrices V = {V(l), . . . , 
V(L)}, where L is bounded, such that Vi € V for any 1 < i < n. 

As will be seen, Assumption Al is essential for consistent estimation of 
the unknown covariances. On the other hand, the assumption is not unrea- 
sonable. We consider some examples. 

Example 1 (Balanced data). Consider the case of balanced data, that is, 
Ji = J, 1 < i < n. Furthermore, suppose that v ij k is unknown, but depends 
only on j and k. Then, one has Vi = V\, an unknown constant covariance 
matrix, 1 < % < n. Thus, Assumption Al is satisfied. 

The following examples show that Assumption Al remains valid in some 
unbalanced cases as well. 

Example 2. Suppose that the observational times are equally spaced. In 
such a case we may assume, without loss of generality, that tj = j. Suppose 
that the responses satisfy 

(2.3) Yij =x' ij (3 + Ui + w ij + eij, 

i = 1, . . . ,n, j G Ji C J, where Xij is a vector of fixed covariates, Uj is a 
subject-specific random effect, Wij corresponds to a serial correlation, and e,j 
represents a measurement error. It is assumed that the lij's are independent 
and distributed as A^(0,cj^), and the e^-'s are independent and distributed 
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as iV(0,o"e). As for the tOj/s, it is assumed that they satisfy the equation of 
the first-order autoregressive, or AR(1), process, Wij = 4>Wij-\ + Zij, where 
<j> is a constant and \(f)\ < 1, and the Z{ j s 3X6 independent with distribution 
iV{0, o\, (1 — (j) 2 )}. Finally, we assume that u, w and e are independent. Under 
this model, we have E(Y^|JQ) = where Xj = (a^j)jeJ»- Furthermore, we 
have Vi = (vij k )j >keJi , where v ijk = <r 2 + <r 2 + a 2 6 jjk , and S j;k = 1 if j = k 
and otherwise (e.g., Anderson [1], page 174). It follows that Vij k does not 
depend on i. We now further specify the Jj's. Suppose that the responses are 
collected over a week such that the data is collected on Monday, Wednesday 
and Friday for 40 percent of the subjects. For the rest of the subjects, the 
data is collected on Tuesday and Thursday. Then the Jj's are either {1, 3, 5} 
or {2, 4}. It follows that D consists of j = 1, . . . , 5, (1, 3), (1, 5), (3, 5), 

(2,4) and the pairs with the components exchanged. Furthermore, the V^'s 
are either or V^ 2 \ where 

i) +CT -(> 2 i) +f7 '(o l)' 
/i i i\ / 1 4> 2 <A (i o o\ 

a 2 u 1 1 1 \+al U 2 1 2 \+al 1 . 
\1 11/ V^) 4 2 1 / \0 1/ 

In practice, the covariance structure of the data may not be known. 
For example, if the vu^'s are known to satisfy a process with covariances 
cov(wij,Wi k ) = a1jp(j,k) with p(-,-) completely unknown, then the covari- 
ance matrices and will be practically unspecified. However, there 
are only two possible covariance matrices. Thus, again by Lemma 1, As- 
sumption Al is satisfied. 

Example 3 (Growth curve). This model is the same as Example 2 ex- 
cept that a term (p + ai)tij is added to the right-hand side of (2.3), where \i is 
an unknown baseline slope, en is a random effect with distribution iV(0, cr 2 ), 
and tij is the time at which response Y\j is collected. Assume that oi, . . . ,a n 
are independent, and the a's are independent with u, w and e. Then we have 
Vij k = o~ 2 + a atijtik + a w4>^~ k ^ + a e^j,k- Note that, unlike Example 2, here 
the expression of Vij k , indeed, depends on i, the subject. However, since 
Uj £ J = {1, . . . ,6}, the number of different values of the product Ujti k is 
bounded. As a result, the number of different Vij k s is bounded, and hence 
Assumption Al is satisfied. 

As in the previous example, the parametric covariance structure may not 
be known in practice, or one may be concerned with model misspecifica- 
tions. Therefore, as a robust approach, one may prefer to use unspecified 
covariances with the understanding that Assumption Al is satisfied. 
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In a way, the models in the previous examples are formulated as the 
classical linear models, in which the covariates considered fixed. We 

now consider a logistic model in which the covariates are considered random. 

Example 4. Suppose that Y^, i = 1, . . . , n, j £ Jj, are binary responses 
(i.e., Yij = or 1). A covariate is associated with the age of the subject. More 
specifically, there are nine age groups: 40-44, . . . , 75-79 and 80 or over. Let 
Xj = j if the age of subject i belongs to the jth group. Suppose that, given 
Xi and an additional (unobservable) subject-specific random effect on, Y^, 
j £ Jj, are (conditionally) independent such that logit{P(Yjj = l|JQ,Oi)} = 
Po + P\Xi + cti, where logit(x) = log{x/(l — x)}, and (3q and (3\ are un- 
known regression coefficients. Furthermore, (Yi,Xi,oii), i = l,...,n, are in- 
dependent, where Yi = (Yi^j^j^ Finally, we assume that cti\Xi ~ N(0,a 2 ). 
It follows that E(Y i:j \Xi) = E{E(yy {X^aJlXi} = E{h((3 + PiXi + a^Xi} = 
E{/i(/?o + 0\Xi + <r£)}, where h(x) = e x /(l + e x ), and the last expectation 
is with respect to £ ~ iV(0, 1). Similarly, we have Vijk = E[{h(f3o + (3±Xi + 
at;)} 2 ~ S 3' k ] - [E{h({3 + fiiXi + at;)}] 2 . If the conditional independence is not 
assumed, neither is any other specified conditional covariance structure, and 
the above expression for may not hold. In such a case, the Vij^s are un- 
specified. However, it is easy to see that for any j, k, there are, at most, 
nine different Vij^s (corresponding to X^ = 1, . . . , 9). Thus, Assumption Al 
is satisfied. 

3. Iterative estimating equations. 

3.1. Estimation of (3 when V{' s are known. Our main interest is to esti- 
mate (3, the vector of regression coefficients. According to the earlier discus- 
sion, if the Vi's are known, (3 may be estimated by the GEE given below, 



where \ii = (fJ-ij)jeJi with fj,ij = gj(Xi, (3), and \ii is the matrix of first deriva- 
tives whose (j, I) element is given by dfiij/dfli, j G Jj, 1 <l<p (e.g., Liang 
and Zeger [12]). Namely, the estimator, (3, is defined as the solution to (3.1). 

3.2. Estimation of V^s when (3 is known. On the other hand, if (3 is 
known, the covariance matrices Vi can be estimated by the method of mo- 
ments (MoM) as follows. Let k, I) denote the set of indexes 1 < i < n such 
that Vijk = v(j,k,l) (see Assumption Al in Section 2). For any (j,k) £ D, 
1 < I < Ljk, define 



(3.2) v(j,k,l)= {Yi 3 -g 3 {X u f3)}{Y lk -g k {Xi,(3)} : 



(3.1) 



Y J tiV i r\Y t -^)=Q 



1 



iei(j,k,i) 
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where n(j,k,l) = \I(j,k,l)\, the cardinality. Then, define V = (vijk)j,keJii 
where v^k = v(J, k,l), if i G fc, I). In this approach, the estimators of the 
covariances are obtained componentwise. Alternatively, the estimators may 
be obtained as matrices. Let 1(1) denote the set of indexes 1 < i < n such 
that Vi = V(l) (see Lemma 1). For any 1 < / < L, define 

(3-3) V ^ = -7n E (Yi-ViWi-M)', 

where n(l) = \I(l)\- Then define Vi = V(l) if i £ 1(1)- The following lemma 
justifies the use of the words "method of moments." Write X = (Xi)i<i< n . 

Lemma 2. Under the assumed model, we have ~E(Vi\X) = Vi = ~E(Vi\X), 
1 < i < n. 

The proof is straightforward and therefore omitted. In some cases, the 
two estimators are identical, as the following lemma shows. 

Lemma 3. Suppose that there is a division J = {1, . . . , b} = J(l) U • • • U 
J(s) such that J(q) n J(r) = 0, if q ^ r and for each 1 < i < n there is 
1 < r < s such that Ji = J(r). Then, we have Vi = Vi, 1 <i <n. 

The proof follows directly from the definitions. Again, we consider some 
examples. 

Example 1 (Continued). Here Ljk = L = 1 and I(j,k, 1) = 1(1) = 
{1, . . . ,n}. So that v ijk = v(j, k, 1) = n~ 1 Y^i=i{Yij-gj(X i ,0)}{Y ik -g k (X i ,f3)}, 
the (j, k) element of V = V= n" 1 E?=i(*i - ia)(Yi - m)'. 

Example 2 (Continued). Consider the special case in which Jj is ei- 
ther {1,3,5} or {2,4}. Let 1(1) = {1 < i < n : Ji = {1, 3, 5}}, 7(2) = {l<i< 
n:Ji = {2,4}} and n(l) = \I(l)\, 1 = 1,2. Then we have 

v(j,k,l) = ^— (Yij-x'^Yik-x'^P), j,k= 1,3,5; 

n[ ~ l > J 4 ={1,3,5} 

v(j,k,2) = ^- J2 (Yij-x'^Yik-x'^P), j,k = 2,A. 
U[Z) -h={2A} 
On the other hand, we have 

v ( 1 ) = ^m E (Yi-XMYi-XiP)', 

n[L) .^={1,3,5} 

V ^) = ^ E (Xi-XiMYi-Xipy, 

n ^> J s ={2,4} 
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where Xi = {x' i j)j^.j i . It follows that Vt = Vi, 1 <i <n. 

The next example shows that, when the two estimators are different, the 
first estimator is likely to be more efficient than the second one. 

Example 5. Consider, again, Example 2, but this time assume that 
every subject has a baseline measure on Monday. Then 50% of the sub- 
jects have follow-up measures on Tuesday, and 50% of the subjects have the 
follow-ups on Wednesday. In other words, Jj is either {1,2} or {1,3}. To be 
more specific, let n = 2m such that subjects 1, . . . ,m have the follow-ups on 
Tuesday, and subjects m + 1, . . . , n have the follow-ups on Wednesday. If we 
assume that the covariances do not depend on i, and write Vjk = cov(Yij,Yik), 
then there are two different covariance matrices, 




On the other hand, it is easy to see that Lj^ = 1 for all possible j, k, 
and 7(1,1,1) = {l,...,n}, 1(1, 2, 1) = 1(2, 2, 1) ={l,...,m} and 1(1,3,1) = 
1(3, 3, 1) = {m + 1, . . . , n}. Thus, we have 

(3-4) v u = -J2^i-^iP) 2 , 

vi2 = m- 1 EZiiYii ~ <MYii - x' i2 (3), v 2 2 = m- 1 £™ 1(^2 - x' l2 (3) 2 , v 13 = 
m^EU^ii " <iP)(Y i3 ~ x' i3 P) and 7)33 = m" 1 E?= m+ i(^3 " < 3 P? ■ 
Similarly, we have 

-1 m 

V(l) = -J2(Yi~ XiP)(Yi - Xtf)', 
i=i 

1 n 

V(2) = - V (Y i -X l p)(Y i -X i py. 
m ■ 1 

i=m+L 

Comparing the estimators, it is seen that v and v are identical except for 
the case j = k = 1. In the latter case, the componentwise estimator of vu, 
vu, is given by (3.4); on the other hand, there are two estimators of v\\ by 
the matrix approach, namely, 

1 m 

v 11 (l) = -Y / (Ya-x' il /3) 2 , 
m . • 
i=i 

1 n 

vn(2)=- x: (Yi-x'^f. 

i=m+l 

However, none of the latter estimators is as efficient as v\\, because each of 
them is based on half of the samples. 
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For such a reason, in the sequel we mainly focus on the elementwise esti- 
mators. 

3.3. Iterative procedure. The main points of the previous subsections 
may be summarized as follows: If the V^'s were known, one could estimate 
(3 by the GEE; on the other hand, if (3 were known, one could estimate the 
ViS by the MoM. It is clear that there is a cycle here, which motivates the 
following iterative procedure. Starting with an initial estimator of /3, use 
(3.2), with (3 replaced by the initial estimator, to obtain the estimators of 
the covariances; then use (3.1) to update the estimator of f3; and repeat the 
process. We call such a procedure iterative estimating equations, or IEE. If 
the procedure converges, the limiting estimator is called the IEE estimator. 

In practice, the initial estimate of (3 may be obtained as the solution to 
(3.1) with Vi= I, the identity matrix (with the suitable dimension). It can 
be shown (see Jiang, Luan and Wang [10]) that in the case of balanced data 
of Example 1, procedure is equivalent to maximum likelihood (ML), if the 
errors £j = Y{ — m are normal. Thus, without the normality assumption, IEE 
may be viewed as a quasi-likelihood method. Although for unbalanced data 
IEE is not equivalent to ML even under normality, the procedure is expected 
to improve the efficiency of the estimator of (3. To see this, note that if the 
initial estimator of (3 is consistent, the estimators of the V^'s by (3.2) are 
expected to be consistent; and, with consistent estimators of the V^'s, the ef- 
ficiency of the next step estimator of (3 should improve (because V{ ~ Vi), and 
so on. Furthermore, IEE is very easy to operate and is, in fact, no stranger 
to practitioners. For example, the procedure has some similarity with the 
so-called backfitting algorithm (e.g., Hastie and Tibshirani [6], Section 4.4). 
But before we study the efficiency and other potentially nice properties of 
this procedure, we need to make sure that it converges. The next question 
is: Given the convergence of IEE, is the limiting estimator asymptotically as 
efficient as the GEE estimator that one would have obtained if the Vi's were 
known? These fundamental questions will be answered in the next section. 

4. Convergence and asymptotic properties. 

4.1. Linear convergence. First we formulate the IEE procedure as fol- 
lows. Let v = (v r )i< r <R denote the vector of different covariances involved 
in the V^'s. We assume that for any v, there is a unique solution to (3.1) for 
p. Denote this solution by (3{v). Also, for any /3, let v{(3) denote the vector v 
whose corresponding components are given by the right-hand side of (3.2). 
We use an example to illustrate. 

Example 5 {Continued). Consider, once again, Example 5. In this case, 
we have v = (vn, v±2, V22, vis, ^33)'. Equation (3.1) has a unique solution 
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given by 

/ n \ — 1 n 

\i=l J i=l 

where X { = (a^JjeJi, J { = {1,2}, V { = V(l), 1 < % < m, and J* = {1,3}, 
14 = V(2), m + 1 < i < n. Furthermore, v{(3) is the vector whose components 
are given by the right-hand sides of (3.4) and the four equations below. 

With such notation, the IEE may be formulated as follows: Take the initial 
v as whose component is 0, if it is a covariance (i.e., j ^ k); otherwise, 
the component is 1. This means that, initially, all the V^'s are taken as the 
identity matrices of suitable dimensions. Then we have f3^ = f3{v^}, v^- 1 ' = 
tf(o) ; /3(2) = / g(i) ? {,(2) =v 0(i)^ and so on In general, we have z^ 2 ™" 1 ) = 

/3 ^(2m-2)| ) £(2m-l) =fi (2m-2). £(2m) = ^(2m-l) ^ {) (2m) = ^(2m-l)| for m = 

1,2, There is an alternative expression, which is more convenient in 

terms of obtaining the convergence rate. Write r = (3 o v and p = v o (3. 
Then we have ft 2m *> = ^ 2m ~ l ) , ft 2m + l ) = r^ 2 ™" 1 )}, t^ 2 ™" 1 ) ={)( 2m - 2 ), 
^(2m) _ ^|-(2m-2)| £ or m = 1^2,.... Theorem 1 below states that, under 
very mild conditions, the IEE converges linearly with probability tending to 
one. Here we adapt a term from numerical analysis: An iterative algorithm 
that results in a sequence x^ m \ m= 1,2, ... , converges linearly to a limit 
x* , if there is < g < 1 such that sup m>1 {|o;( m ) — x*\/g m } < oo (e.g., Press 
et al. [14]). 

Let L\ = maxi<i< n max, e j t Sy with s^- = &iP\p-p\< E1 1 (d/d(3)gj (X i5 /3)|, 
where (5 represents the true parameter vector, E\ is any positive constant, 
and (d/dp)f((3) means (df /dff)\g_s. Similarly, let L2 = maxi<j< n maxj^j. Wij, 

where Wij = sup^_^ <£i \\(d 2 /df3d/3')gj(Xi,$)\\ (|| • || is the spectral norm 
defined in Section 9). Furthermore, define the set V = {v : A m i n (Vi) > So, 
Amax(Vi) < Mo, 1 < i < n}, where A m i n and A max represent the smallest and 
largest eigenvalues, respectively, and 5o and Mq are given positive constants. 
Note that V is a nonrandom set. 

An array of nonnegative definite matrices {^4n,«} is bounded from above 
if ||^4 n .i|l — c f° r some constant c; the array is bounded from below if A~\ 
exists and ||^4^|| < c for some constant c. A sequence of random matrices is 
bounded in probability, denoted by A n = Op(l), if for any e > 0, there are 
M > and N>1 such that P(||Ai|| < M) > 1 - e, if n > N. The sequence 
is bounded away from zero in probability if A' 1 = O p (1). Note that the 
definition also applies to a sequence of random variables, considered as a 
special case of random matrices. 

Recall that p is the dimension of (3 and R the dimension of v. Also re- 
call assumption Assumption Al (above Lemma 1). We make the following 
additional assumptions. 
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Assumption A2. The functions gj(Xi,/3) are twice continuously differ- 
entiable with respect to (3; E(|li| 4 ), I < % < n, are bounded; and L\, L 2 , 
maxi<i<„(||^|| V ll^" 1 !!) are O p (1). 

Assumption A3 (Consistency of GEE estimator). For any given Vi, 
1 < % < n, bounded from above and below, the GEE equation (3.1) has a 
unique solution (3 that is consistent. 

Assumption A4 (Differentiability of GEE solution). For any v, the 
solution to (3.1), (3(v), is continuously differentiable with respect to v, and 

sup w6V ||0/3/0T;||=Op(l). 

Assumption A5. n(j, k, I) — > 00 for any 1 < I < Lj k , (j, k) e D, as n —* 00. 

Theorem 1. Under Assumptions A1-A5, P(IEE converges) -> 1 as 
n —>■ 00. Furthermore, we have P[sup m>1 {|/3^ — (3*\/(pn) m / 2 } < 00] — > 1, 
P[sup m>1 {|£>( m ) - v*\/(Rn) m / 2 } < 00] — >■ 1 as 00 for any < n < (p V 
i?) _1 ; where (f3*,v*) is the (limiting) IEE estimator. 

Remark 1. Because our approach is based on estimating functions 
(EFs), it is therefore necessary to impose some "regularity" conditions to 
make sure that the EFs are valid. Assumption A3 may be regarded as 
such regularity conditions. Note that here we did not specify the techni- 
cal conditions for the existence, uniqueness and consistency of the solution 
to the GEE equation (3.1). However, standard conditions for the existence 
and consistency of the GEE solution have been discussed extensively in 
the literature. See, for example, Liang and Zeger [12] and Heyde [7], Sec- 
tion 12.2. Furthermore, an easy-to-verify sufficient condition can be given 
for the uniqueness of the GEE solution, with the additional assumption 
that the initial estimator, [3^ l \ is consistent. To see this, note that by the 
proof of Theorem 1 given in Section 9, it is seen that the linear conver- 
gent property of the sequences 0^ m ' and (m = 1,2,...) only depends 
on a small neighborhood of the true (3 where the first sequence is con- 
fined, if the initial estimator /?W is consistent. On the other hand, note 
that, for fixed V^'s, the left-hand side of (3.1) is the gradient of a func- 
tion of (3. Namely, write q((3) = Ei=ii Y i ~ - Mi)- Th en, dq/d/3 
is —2 times the left-hand side of (3.1). Therefore, any conditions that guar- 
antee (local) convexity of q(-) will be sufficient for the (local) uniqueness 
of the solution to (3.1) [note that, because of the earlier remark, all one 
needs is that (3.1) has a unique solution locally, i.e., in a small neighbor- 
hood of the true (3\. It is easy to show that d 2 q/dj3d(3' = 2(h - I 2 ) with 
h =E?=i(dlJ.i/df3)V- 1 (dfM i /d(3') and h = T2=lh,i, where the (k,l) ele- 
ment of /2,« is (d 2 n' i /dPkdPi)V~ 1 (Yi — fii). Since I\ is nonnegative definite, 
a condition of (asymptotic) nonsingularity is: 
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(i) liminf^oo A min (n ^i) > 0. 

Furthermore, it is seen that at the true f3, I2 has mean zero. Also, Assump- 
tion A2 and boundedness from below of the Vi's imply that var(/2) = 0(n). 
Hence, we have n _1 /2 —* in Thus, if I2 is uniformly continuous in /?, the 
matrix 8 2 q/ 8(3 8(3' is expected to be (asymptotically) positive definite in a 
small neighborhood of the true (3, which implies asymptotic local convexity 
of q(-). A condition for the (asymptotic) uniform continuity is: 

(i) for any 5 > 0, there is e > such that — < 6, \siki — sn-i\ < S 
for all 1 < i < n and k, I G Jj, if |/3 — /3\ < e. 

Here s^i denotes 8 2 '8(3^8 'f3i evaluated at (3 and s^i the same quantity 
evaluated at (3. It can be shown that conditions (i) and (ii) are, indeed, 
sufficient for the asymptotic uniqueness of the solution to the GEE equation 
(3.1) (see Jiang, Luan and Wang [10]). 

Remark 2. For the most part, Assumption A4 is about differentiability 
of the implicit function j3{v). It simply requires that the (3 estimates not be 
too sensitive to the changes in v. This is quite reasonable because, other- 
wise, the estimates of /3 will be very unstable so that convergence cannot 
be achieved. Sufficient conditions for such differentiability can be found in 
standard texts of advanced calculus. For example, the following result can 
be implied directly from the theorem on page 265 of Courant and John [2]: 
Let 4> denote the left-hand side of (3.1) as a (vector- valued) function of f3 
and v. If equation (3.1) is satisfied at a point (/?',?)')', and the Jacobian of (f> 
with respect to differs from zero at that point, then in the neighborhood 
of that point equation (3.1) can be solved in one and only one way for f3, and 
this solution gives (3 as a continuously differentiable function of v. Note that 
the existence of (f3' ,v')' is a consequence of Assumption A3 (see the remark 
above). Also, as noted above, for the results of Theorem 1 to hold, one only 
needs the assumptions to be valid in a small neighborhood of (3. Finally, the 
nonzero-Jacobian property is only required with probability tending to one. 

Remark 3. It is clear that the restriction r\ < (p V i?) _1 is unnecessary 
[because, for example, (pr]i)~ m / 2 < (prj2)~ m ^ 2 for any 771 > (p V -R) -1 > 772], 
but linear convergence would only make sense when g < 1 (see the definition 
above). 

Remark 4. The proof of Theorem 1 in fact demonstrates that for any 
5 > 0, there are positive constants M\, M2 and integer that depend only 
on 5 such that, for all n > N, we have Pfsup^-tl/^" 1 ) - /3*|/(pr ? ) m / 2 } < 
Mi] > 1 - 5, P[su Pm>1 {|^ m ) - v*\/{Rrj) m l 2 } < M 2 ] >l-6. 
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4.2. Asymptotic behavior of the IEE estimator. In the discussion in Sec- 
tion 3.3 we conjectured that the (limiting) IEE estimator is an efficient esti- 
mator. In this subsection we show that this conjecture is indeed true. Since 
efficiency is usually defined from an asymptotic point of view (e.g., Lehmann 
[11]), we need to study asymptotic properties of the IEE estimator. The first 
result is regarding its consistency. 

Theorem 2. Under the assumptions of Theorem 1, ((3*,v*) is consis- 
tent. 



To establish the asymptotic efficiency, we need to strengthen Assumptions 
A2 and A5. Define the following: Z^o = uiaxi<j< n max^gj^ \\d 2 fj,ij/df3d(3'\\, 
L 3 = maxi<j< n maxjej. (%, where 



dij = max sup 

i<a,6 )C < P| ^_^ | < ei 



d 3 

9j(Xi,P) 



dp a df3 b op c 



Assumption A2'. Same as Assumption A2 except that gj(Xi,f3) are 
three-times continuously differentiable with respect to (3, and that L2 = 
Op(l) is replaced by L2,o V L3 = Op(l). 

Assumption A5'. There is a positive integer 7 such that n/{n(j, k, Z)} 7 — 
for any 1 < I < Ljk, (j, A;) £ D, as n — > 00. 

We also need the following additional assumption. 

Assumption A6. n~ 1 Ya=i f 1 '^ -1 ^ is bounded away from zero in prob- 
ability. 

Also, let $ be the solution to (3.1) with the true Vi's. Note that (3 is 
efficient, or optimal in the sense discussed in Section 1, but not computable 
unless the true Vi's are known. 

Theorem 3. Under Assumptions Al, A2', A3, A4, A5' and A6, we have 
y/n((3* — (3) — > in probability. Therefore, asymptotically, [3* is as efficient 
as (3. 



The proofs of Theorem 2 and Theorem 3 follow, for the most part, the 
standard arguments for asymptotic properties of Z-estimators (e.g., van der 
Vaart and Wellner [16], Section 3.3) and optimality of estimating equations 
(Godambe [4]), and therefore are omitted. Interested readers are referred to 
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a technical report (Jiang, Luan and Wang [10]). The proof of Theorem 3 
also reveals the asymptotic expansion 

(4-1) /3*-/?=(]T^V! E/^-^-^ + ^T, 



n 



where op(l) represents a term that converges to zero (vector) in probability. 
By Theorem 3, (4.1) also holds with j3* replaced by 0, even through the 
latter is typically not computable. In the next section, we will be looking 
at a case with an "exact" version of (4.1), that is, the equation without the 
term op(l)/s/n, for 0. 

Suppose that the initial estimator of (3 is consistent. Then, under some 
regularity conditions, the next-step estimator of v is consistent; furthermore, 
the next-step estimator of /?, say, /3® [according to the notation below Ex- 
ample 5 (continued) in Section 4.1], is not only consistent but also asymptot- 
ically as efficient as (3. In other words, one has obtained an efficient estimator 
of (3 after one iteration of the IEE. Although both (3* and (3^ are efficient 
estimators, there are reasons that (3* is more desirable in some situations. 
For example, in the case of balanced data of Example 1 with normality, (3* 
is the same as the maximum likelihood estimator (MLE), while (3^ is not. 
In Section 6.2 we show by a simple simulated example that IEE may result 
in substantial improvement over in a small sample situation. Although, 
in general, (3* may not be the MLE (especially if normality fails), there is a 
tendency of seeking for continuing improvement, and therefore not to stop 
after one iteration. It should be pointed out that the continuing improve- 
ment may not be as significant as in the early stage when the procedure is 
close to convergence. Nevertheless, because of the fast convergence (Theo- 
rem 1; also see Sections 6 and 7), one may not need to run the IEE for a 
lot of steps anyway in order to achieve convergence. This, combined with 
the fact that one has an asymptotically efficient estimator even after one 
iteration, would give a practitioner double confidence that he/she does not 
have to run the IEE for many steps in order to use it. 

5. The special case of linear models. One special case of the semipara- 
metric regression model is the following linear model for longitudinal data. 
Let Xi be a matrix of fixed covariates associated with the ith subject. Sup- 
pose that Yi, . . . ,Y n are independent such that E(y,) = Xi(3 and Var(Yj) = 
Vi, where Vi is an unknown fixed covariance matrix, 1 < i < n. Under this 
model, the GEE (3.1) reduces to E?=i X i V i~ l ( Y i ~ x iP) = °> which has an 
explicit solution 

/ n \ — 1 n 

(5.i) 0= l^xlv^Xi) E^'vr 1 ^, 

\i=l / i=l 
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provided that the matrix Ya=i X'jV^Xi is nonsingular. Note that (5.1) 
is the optimal weighted least squares (WLS) estimator, provided that the 
ViS are known. Here a WLS estimator is defined as the solution to the 
minimization problem ming{(Y — X/3)'W(Y — X/3)}, where Y = (li)i<j< n , 
X = pQ)i<j<„ and W is a weighting matrix. Thus, (5.1) corresponds to 
WLS with W = V~ l = diag(T^~ , 1 < i < n). This estimator is also known 
as the best linear unbiased estimator, or BLUE, denoted by /3blue- 

The IEE developed in previous sections can now be expressed more ex- 
plicitly. Namely, if the VJ's are known, one can calculate the BLUE by (5.1). 
On the other hand, if (5 is known, the Vi's can be estimated by MoM as 
follows [see (3.2)]: 

(5.2) v(j, k, I) = —rr-j—r: £ (Y i3 - X'^)^ - X> k [3), 

where X'^ is the jth row of When both /3 and the V{'s are unknown, one 
iterates between (5.1) and (5.2), starting with the OLS estimator. The latter 
is (5.1) with V% replaced by the identity matrix with the same dimension as 
Vi. The procedure is also called iterative reweighted least squares, or IRLS. 

It is clear that IRLS is very easy to operate and is, in fact, no stranger to 
practitioners. However, even in this special case, very little is known about 
the convergence property. A procedure with some similarities has been used 
in robust linear regression (Huber [8]), and its convergence property has been 
studied. For example, Wolke and Schwetlick [18] considered a similar itera- 
tive procedure in solving robust regression problems involving an additional 
scale parameter. However, longitudinal data is certainly more complicated. 

The results of Section 4 now have their versions in this special case. 

Corollary 1. Suppose that ||T^ -1 ||, \\Xi\\ and E(|Y;| 4 ), 1 < i < n, are 
bounded, Assumption A5 holds and liminfn^oo A m i n (n _1 Ya=i X^Xi) > 0. 
Then we have P (IRLS converges) -» 1, Pfsup^^l/?^ - P*\/(pr]) m / 2 } < 
oo] — ► 1, and P[sup m>1 {|£>( m ) — v*\/(Rr]) m / 2 } < oo] — > 1 as n — > oo for any 
0<r]< (pVi?)" 1 , where (j3*,v*) is the (limiting) IRLS estimator. 

Corollary 2. Under the conditions of Corollary 1, (J3*,v*) is consis- 
tent. 

Corollary 3. Under the same conditions of Corollary 1 except that As- 
sumption A5 is strengthened to Assumption A5', we have y / n(/3* — Pblve) 

in probability. Therefore, asymptotically, (3* is as efficient as the BL UE. 

Note. The condition liminfn^oo A m i n (n~ 1 Ya=i X'iXi) > in Corollary 

1 needs some interpretation. Note that the term inside (• • •) is n~ 1 X'X, 
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where X = (Xj)i<j< n . Recall that the covariance matrix of the OLS estima- 
tor /3 ls is (X'Xy 1 {X'VX)(X'Xy\ where V = diag(VS, 1 < i < n). If the 
ViS are bounded from above and below, then we have Var(/3oLs) ~ (X'X)' 1 . 
Hence the condition is equivalent to the covariance matrix of y/n(/3oLS — P) 
being bounded. 

The proofs follow by verification of the assumptions of the theorems in 
Section 4. In the next section we study the empirical performance of IRLS. 

6. Simulations. 

6.1. A simulation regarding Example 2. In this subsection, we consider 
Example 2 in Section 2 (continued in Section 3) with simulated datasets. 
More specifically, we have = /?o + PiXij, where the generated 
from a N(0, 1) distribution and then fixed throughout the simulation. We 
let n = 100. The Jj's are specified as in Example 2, that is, Jj = {1,3,5} or 
{2,4}. 

One potential advantage of IRLS is that it requires neither normality nor 
a parametric covariance model. In this simulation we consider three different 
scenarios. The first is when both normality and the parametric covariance 
model mentioned in Example 2 hold. The second is when normality fails 
but the parametric covariance model is still correct. In this scenario the 
true distribution of the random effects is centralized exponential [i.e., the 
distribution of <7 U (£ — 1), where £ ~ Exponential 1)], while the distribution of 
w and e remains the same as in Scenario 1. The third is when both normality 
and the parametric covariance model no longer hold. In this scenario, the 
true distribution of u and w remains the same as in Scenario 2, while the 
true process of w is MA(1) instead of AR(1) with normal noise z. In each 
scenario, we consider two cases of parameter values: (1) a 2 = 1, = 9, 
a\ = 1 and (f> = 0.9; (2) a 2 u = 9, a 2 w = 25, a 2 = 1 and cj> = 0.99. Note that the 
second case corresponds to stronger within-subject correlations than the 
first case. In all cases, the true values for (3 are flo = 0.5, f3\ = 1.0. These 
scenarios/cases will be denoted by 1.1, 1.2, and so forth. 

We first study the convergence property of IRLS. The convergence crite- 
rion is max{|/3( m) 1^ \}+^ meD {\v^ -v^ \] < 

10 -4 . Note that here d = \D\ = 9. In each case, we recorded the number of 
steps it took to converge. The results are summarized in Table 1. 

Next, we compare the performance of IRLS with OLS and MLE, where 
MLE is that under normality and the parametric covariance model described 
in Example 2. Note that, under Scenarios 2 and 3, the MLE is not the true 
MLE. In Tables 2 and 3, we report the simulated means and covariance 
matrices as well as the true covariance matrix of the BLUE, even though 
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the latter is not computable in practice. The covariance matrix of the BLUE 
serves as a lower bound for the covariance matrix of a WLS estimator. It 
should be pointed out that the IRLS estimator is not a WLS estimator 
and neither is the MLE, because both are nonlinear in Y. Nevertheless, 
we expect, by Corollary 3, the covariance matrix of the IRLS estimator to 
be close to that of the BLUE. The results for OLS and IRLS are based 
on 1,000 simulations. The results for MLE are a bit complicated. Although 
1,000 simulations were run, not all resulted in convergence. Therefore, the 
results for MLE are based on the runs which converged. More specifically, for 
the Cases 1.1, 1.2, 2.1, 2.2, 3.1 and 3.2, the number of runs which converged 
for MLE were 798, 812, 794, 805, 749 and 724. (So, in the worst case, about 
28% of the runs failed to converge.) In Table 2, /3ols> Arls an d $mle denote 
the simulated means of OLS, IRLS estimators and MLE. Similarly, in Table 
3 UoLSi ^irls and Vmle are the simulated covariance matrices and Vblue 
is the true covariance matrix of the BLUE. 

Summary of results. In most cases the IRLS algorithm converged in four 
to six steps. As for comparison of estimators, the main difference is in the 



Table 1 
Number of steps to converge 



Case 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


1.1 


0.0 


4.2 


34.1 


47.8 


11.6 


2.1 


0.2 


0.0 


0.0 


0.0 


1.2 


0.0 


0.6 


19.4 


45.9 


25.5 


7.1 


1.0 


0.3 


0.1 


0.1 


2.1 


0.1 


3.9 


42.1 


38.9 


12.5 


2.1 


0.2 


0.1 


0.1 


0.0 


2.2 


0.0 


1.0 


21.5 


45.7 


23.3 


6.8 


1.3 


0.4 


0.0 


0.0 


3.1 


0.0 


5.8 


39.9 


39.3 


12.8 


2.0 


0.2 


0.0 


0.0 


0.0 


3.2 


0.0 


1.4 


25.2 


43.1 


22.9 


5.9 


1.1 


0.3 


0.1 


0.0 



Numbers in the first row represent the steps; in each case, the numbers are percentages 
of times (out of a total of 1,000 simulations) that the IRLS converged after certain steps. 
Here Case 1.1 represents Scenario 1, Case 1, and so forth. 



Table 2 
Simulated mean vectors 



Case 




Scenario 1 






Scenario 2 






Scenario 3 




^OLS 


/^IRLS 


^MLE 




/^IRLS 


i^MLE 


/^OLS 






1 


0.50 


0.50 


0.50 


0.50 


0.49 


0.50 


0.51 


0.51 


0.52 




0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


1.01 


1.01 


1.01 


2 


0.50 


0.51 


0.50 


0.51 


0.51 


0.53 


0.50 


0.47 


0.56 




1.00 


1.00 


1.00 


1.00 


1.01 


1.00 


1.02 


1.02 


1.01 



In each case, the top row gives the simulated means for f3o; the bottom row gives those 
for 
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variances of the estimators of (5\. Both IRLS and MLE significantly out- 
perform OLS. Furthermore, the performance of IRLS is very close to that 
of BLUE, although the latter is not computable in practice. Note that, 
when there is no correlation, OLS is the same as BLUE. As <r„, and <f> 
increase, the within-subject correlation increases and, as a result, the differ- 
ence between OLS and IRLS becomes larger. In the first two scenarios, MLE 
appeared to slightly outperform IRLS (and BLUE). This suggests that nor- 
mality is not very important for the efficiency of the MLE for estimating (5. 
However, the situation changes in Scenario 3 with IRLS (and BLUE) slightly 
outperforming MLE. In fact, the largest difference in favor of MLE is 18% 
less than IRLS in Case 2.2, while the largest difference in favor of IRLS is 
14% less than MLE in Case 3.2. This suggests that the efficiency of MLE 
is likely to be undermined by misspecification of the covariance structure. 
It should be pointed out that, as mentioned above, the results for MLE are 
based on only runs which converged, the numbers of which are somewhere 
between 20 to 30 percent less. We expect the actual performance of MLE to 
be somewhat worse than reported here if all runs are reported, because the 
discarded runs may correspond to some "bad cases." Despite such concerns, 
the simulation results are consistent with our theoretical findings. 

6.2. A comparison with the one-step estimator. In this subsection, we 
compare the small-sample performance of IEE with the one-step estimator 
/?( 3 ) discussed in the second paragraph following Theorem 3. We consider the 
following simple example: Y{j = /3q + PiXi + , i = 1, . . . , n, j = 1,2, where 



Table 3 
Simulated covariance matrices 



Case 


V 


OLS 


V 


IRLS 


V 


MLE 


Vblue 




1.1 


9.80 


0.11 


10.36 


0.36 


10.00 


0.48 


9.30 


0.11 




0.11 


5.51 


0.36 


2.09 


0.48 


1.87 


0.11 


2.16 


1.2 


36.41 


0.61 


36.41 


-0.10 


35.57 


-0.08 


34.10 


0.06 




0.61 


17.41 


-0.10 


1.30 


-0.08 


1.23 


0.06 


1.29 


2.1 


9.87 


0.14 


9.94 


0.07 


9.19 


0.02 


9.30 


0.11 




0.14 


5.35 


0.07 


2.25 


0.02 


2.08 


0.11 


2.16 


2.2 


38.74 


1.78 


38.48 


-0.01 


34.54 


0.15 


34.10 


0.06 




1.78 


15.86 


-0.01 


1.31 


0.15 


1.11 


0.06 


1.29 


3.1 


7.14 


0.44 


7.25 


0.41 


6.77 


0.45 


7.15 


0.23 




0.44 


5.34 


0.41 


4.01 


0.45 


4.13 


0.23 


3.74 


3.2 


25.48 


0.26 


26.22 


0.22 


24.73 


0.46 


25.40 


0.63 




0.26 


17.55 


0.22 


10.80 


0.46 


12.35 


0.63 


9.61 



In each case, the (2 x 2) simulated covariance matrices corresponding to OLS, IRLS and 
MLE and the true covariance matrix of the BLUE are presented. Here Case 1.1 represents 
Scenario 1, Case 1, and so forth. 
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the Xj's are known covariates; (3q, j3\ are unknown regression coefficients; 
the Eij 's are independent with e^- ~ iV(0, cr|), j = 1, 2. 

Here we let n = 10. The Xj's are generated from a Uniform[0, 1] distri- 
bution, and then fixed throughout the simulation. The true parameters are 
chosen as (3o = 0.2, f3\ = 0.1, a\ = 1.0 and 02 = 4.0. The results based on 
1,000 simulations are reported in Table 4. It is seen that, once again, there 
is not much difference in terms of the simulated means. However, the sim- 
ulated variances of the one-step estimator are about 31% and 28% larger 
than those of the IRLS estimator, which is the same as the MLE in this 
case, for the estimation of flo and [3\ , respectively. Of course, both the one- 
step and IRLS estimators significantly outperformed the OLS estimator in 
terms of the simulated variance. It is remarkable that, although Theorem 1 
(or Corollary 1) only ensures convergence with probability tending to one 
in large samples, with a fairly small sample size of n = 10, every one of our 
1,000 IRLS runs actually converged. Here the criterion for convergence is 
similar to that in the previous subsection. The number of steps it took to 
converge ranged from 3 to 64 steps, and so are not reported here. 

7. A numerical example. In this section we consider a data set presented 
by Hand and Crowder [5] regarding hip replacements of 30 patients. Each 
patient was measured four times, once before the operation and three times 
after, for hematocrit, TPP, vitamin E, vitamin A, urinary zinc, plasma zinc, 
hydroxyprolene (in milligrams), hydroxyprolene (index), ascorbic acid, caro- 
tine, calcium and plasma phosphate (twelve variables). One important fea- 
ture of the data is that there is a considerable amount of missing observa- 
tions. In fact, most of the patients have at least one missing observation for 
all twelve measured variables. In other words, the longitudinal data set is 
(seriously) unbalanced. 

We consider the measured variable: hematocrit. This variable was con- 
sidered by Hand and Crowder [5], who used the data to assess age, sex and 
time differences. The authors assumed an equicorrelated model and obtained 
Gaussian estimates of regression coefficients and variance components (i.e., 
MLE under normality). Here we take a robust approach without assuming 
a specific covariance structure. The covariates consist of the same variables 



Table 4 

Comparison with the one-step estimator 







Estimation of /3 






Estimation of /3 1 






OLS 


One-step 


IRLS 


OLS 


One-step 


IRLS 


Simulated mean 


0.15 


0.17 


0.18 


0.16 


0.16 


0.15 


Simulated variance 


31.40 


12.70 


9.73 


60.71 


24.76 


19.35 
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as suggested by Hand and Crowder. They are an intercept, sex, occasion 
(three), sex by occasion interaction (three), age and age by sex interaction. 
For the hematocrit data the IRLS algorithm converged in seven steps. The 
results are shown in Table 5. The Gaussian (point) estimates of Hand and 
Crowder [5], page 106, are also included for comparison. 

It is seen that the IRLS estimates are similar to the Gaussian estimates, 
especially for the parameters that are found significant. This is, of course, 
not surprising, because the Gaussian estimator and IRLS should both be 
close to the BLUE, provided that the covariance model suggested by Hand 
and Crowder is correct (the authors believed that their method was valid in 
this case). Taking into account the estimated standard errors, we found the 
coefficients 0i, 03, 04, 05 and (3q to be significant and the rest of the coeffi- 
cients to be insignificant, where 0i,02> ■ • ■ are the coefficients corresponding 
to the variables described in the second paragraph of this section in that 
order. These are consistent with the findings of Hand and Crowder, with 
the only exception being (3q. Hand and Crowder considered jointly testing 
the hypothesis that 06 = 07 = (3g = and found an insignificant result. In 
our case, the coefficients are considered separately, and we found 07 and 08 
to be insignificant and 06 to be barely significant at 5% level. However, since 
Hand and Crowder did not publish the individual standard errors, this does 
not necessarily imply a difference. 

8. Discussion and concluding remarks. In analysis of longitudinal data 
under linear mixed models, maximum likelihood (ML) and restricted max- 
imum likelihood (REML) are well-established methods which apply to lin- 
ear mixed models in general. What is the advantage of IRLS over these 
methods? First, the ML and REML methods require (i) normality; and (ii) 
correct specification of the parametric covariance structure. We do not be- 
lieve that (i) is very important to the estimation of 0, because the ML and 
REML estimators are consistent even without normality (Richardson and 



Table 5 
Estimates for hematocrit 



Coef. 


0i 


02 


03 


04 


05 


06 


07 


08 


09 


010 


IRLS 
s.e. 


3.19 
0.39 


0.08 
0.14 


0.65 
0.06 


-0.34 
0.06 


-0.21 
0.07 


0.12 
0.06 


-0.051 
0.061 


-0.051 
0.066 


0.033 
0.058 


-0.001 
0.021 



Gaussian 3.28 0.21 0.65 -0.34 -0.21 0.12 -0.050 -0.048 0.019 -0.020 

The first row gives IRLS estimates corresponding to, from left to right, intercept, sex, 
occasions (three), sex by occasion interaction (three), age and age by sex interaction; the 
second row gives estimated standard errors corresponding to the IRLS estimates; the third 
row gives the Gaussian maximum likelihood estimates that were obtained by Hand and 
Crowder [5]. 
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Welsh [15], Jiang [9]). However, (ii) is crucially important. In fact, if (ii) 
fails, the ML and REML estimators may lose efficiency, even consistency 
(e.g., Liang and Zeger [12], White [17]). Note that, if the covariances are 
estimated using inconsistent estimators, the resulting "BLUE" is no longer 
BLUE, even asymptotically. In contrast, IRLS is not normality-based and 
does not require a parametric covariance structure. In this paper, we have 
shown that the IRLS estimator is asymptotically as efficient as the BLUE 
regardless of the covariance structure (and normality). This is confirmed 
by our simulation results. Second, IRLS is computationally easier to im- 
plement. The ML and REML methods require maximization of a nonlinear 
function or, at least, solution of a system of nonlinear equations. Although 
standard numerical procedures are available, problems and difficulties are 
often encountered in practice. For example, the Newton-Raphson procedure 
is known to be inefficient when the dimension of the solution is high, and its 
convergence is heavily affected by the choice of starting values; the EM algo- 
rithm is known to converge slowly. On the other hand, each step of IRLS is 
defined by closed-form expressions, therefore can be calculated analytically; 
the convergence is very fast, as we have demonstrated, and one does not 
need to worry about starting values. Note that because of the fast conver- 
gence of IRLS, in practice one may not need a lot of iterations (see the last 
paragraph of Section 4). This is also confirmed by our simulations. By the 
way, we have encountered difficulties in our simulations for computing the 
ML or quasi-likelihood estimator using the standard Newton-Raphson pro- 
cedure. In fact, in some of our simulations, nearly thirty percent of the ML 
runs failed to converge. In contrast, every one of our IRLS runs converged 
and converged quickly, sometimes in two or three steps. Furthermore, the 
asymptotic properties of IEE, namely, Theorems 2 and 3, make sure that 
ii* always converges (in probability) to the true v, and hence results in 
an asymptotically optimal estimator of (3. Other methods such as ML or 
REML do not necessarily ensure the latter property in case of misspecifica- 
tion of the variance-covariance structure. Therefore, IEE presents a robust 
and asymptotically efficient estimation procedure. 

A basic assumption in this paper is Assumption Al, which essentially 
says that the number of different covariances between the observations is 
bounded. Although the assumption is required for establishing the theoret- 
ical properties of IEE, it does not mean that IEE cannot operate without 
such an assumption — it just makes it more difficult to justify. Alterna- 
tively, if the number of different covariances increases with the sample size, 
a parametric model for the covariance structure may be assumed to reduce 
the number of covariance parameters. Although, as mentioned earlier, such 
a model may be more sensitive to model misspecifications, it seems to be a 
reasonable approach in this difficult situation. For example, one may think of 
a linear model for the covariance components with factor covariates. Then, 
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under Assumption Al, the number of columns of the design matrix in such 
a linear model is finite. More generally, one may incorporate continuous co- 
variates in the linear model, which may be useful in some cases where the 
covariances of the observations depend on continuous covariates. Note that, 
however, the explicit expressions of the estimates of the covariance com- 
ponents that we derived in this paper may not exist under a parametric 
model. In fact, even under a linear model the estimates of the covariance 
components are subject to constraints, such as nonnegativity for the variance 
components, which are not guaranteed for the least-squares type estimates. 
Nevertheless, it is possible to develop a similar iterative procedure under a 
parametric covariance model. 

IEE is not very picky about the initial estimator. For example, the starting 
point of IRLS is the OLS estimator; however, this is not an essential part 
of the algorithm. In fact, the only properties of OLS that are used are that 
(i) it is consistent, and (ii) W = I is bounded from above and below (i.e., 
|| W|| and IIVF -1 )! are bounded). Therefore, the starting point of IRLS may be 
replaced by another WLS estimator such that W is bounded from above and 
below. A further question is whether one actually has global convergence, 
meaning that the algorithm converges regardless of the initial estimator (e.g., 
Luenberger [13]). It is an interesting question for which we do not have an 
answer at this point. 

In this paper, we not only have proved that the IEE procedure converges 
under very mild conditions, we have shown that it converges linearly, that 
is, at an exponential rate. This is the main theoretical finding of this paper. 
Although the convergence of IEE would have been expected, the issue has 
never been rigorously addressed, especially regarding its convergence rate. 
Furthermore, the limiting IEE estimator is consistent and asymptotically ef- 
ficient. These theoretical results are confirmed by our simulation studies. In 
addition, our method leads to consistent estimators of the covariances with- 
out assuming a parametric covariance structure and/or normality. Finally, 
we extended the robust estimation procedure (see Section 1) to unbalanced 
cases, which should make it a more attractive method for the analysis of 
longitudinal data. 

9. Proof of Theorem 1. Notation. Throughout the rest, f3, v, and so 
forth represent the true (3, v , and so forth when there is no confusion. Given 
v, the corresponding (5 is understood as (3 = f3(v); similarly, given /3, the 
corresponding v is v = v(f3) (see Section 4.1). Note that the two equations 
j3 = f3(v) and v = v{f3) need not hold simultaneously (unless at convergence). 
Therefore, for example, vo(3(v) may not equal v. Also, we use notation such 
as (3^> for (5{v^}, and so forth, when there is no confusion. 

The (spectral) norm of a matrix A is defined as \\A\\ = {A max (^4'^4)} 1 / 2 , 
and the 2-norm is defined as 1 1 ^4 1 1 2 = {tr(A'A)} 1 ^ 2 . 
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Let So and Mq denote positive numbers such that Sq < 1, Mq > 1 and 
5o < Mq/16. We define the following sets [see notation below (2.2), above 
(3.2) and above Theorem 1]: 

A) = {|/3 (1) -/3|<£ }, 

where £o is a positive constant suitably chosen later on; 

Bi = {A min (V-) > 25 and A max (^) < M /2, 1 < i < n}, 

where Vi = (vij k )j t k£Ji and is defined by the right-hand side of (3.2), if 
Vijk = v{j,k,l); 

B 2 = \n{j,k,l)- 1 ]T {l + Sij )\Y lk -fi lk \<e- 1/2 ,l<l<L jk ,(j,k)eD\; 

£ 3 = {sup||<9/3/cto|| <eo 1/4 j; 
Uev J 

-B4 = a set defined in Lemma 5 below such that P(i?4) — ► 1 as n — > oo; 
B 5 = |n(i,A;,0~ 1 



^ (dnij/df3q)(Y ik - n ik ) 
iei(j,k,t) 



< 1/2 



l<g<p,l<Z<L ife ,(j,fc)e£>j; 

Co = {(l/4)A min (y 4 ) > <5 and 4A max (V r i ) < M , 1 < i < n}; 

C l = {b 2 (2e 1 /2 + L 2 1 e 2 )<5 }; 

C 2 = {y/R(RVp)[Llel /A + (L 2 + l)e\ /A ] < V /2}. 

A rule for the notation is that B represents a set involving both the Y^'s 
and the XiS, while C represents a set that only involves the X^s. Note that 
these are subsets of the probability space on which all the random variables 
are defined. 

For the most part, the proof consists of two major steps. The first is 
to show that the sequences /3( m ) and (m = 1,2,...) are bounded. The 
second is to show that, within such a bounded range, the partial derivatives 
of r(-) and p(-) are bounded by some small numbers. We first state and 
prove some lemmas. Lemma 4 below ensures that, with high probability, (3 
is confined to a neighborhood of (3 implies that v is bounded; while Lemma 
5 ensures the opposite, that is, with high probability, v is bounded implies 
that (3 is confined to the same neighborhood of (3. Then, with Lemma 4 
and Lemma 5, one is running a cycle, which allows one to argue that the 
entire sequence of 

y(m) ^ jtj, = 1, 2, ... , is bounded, starting from the 
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beginning. Given that the sequence is bounded, the next thing is to obtain 
upper bounds for the derivatives, which is what Lemma 6 does. The proofs of 
Lemma 4 and Lemma 5 are fairly straightforward and therefore are omitted. 
Recall that e\ is the arbitrary positive constant involved in the definition of 
Lj, j = 1,2,3 (see Section 4). 

Lemma 4. For any < £q < E\, we have {\$ — 0\ < eq implies A min (Vi) > 
5 Q and A max (Vi) < (9/16)M , l<Kri}DSin5 2 nCi. 

Lemma 5. Under Assumptions A3 and A4, for any Eq > 0, there is a 
set £?4 with P(B^) — ► 1 as n — > oo such that {v € V implies \/3 — (3\ < Eq} D 

B 3 n5 4 . 



Lemma 6. For any < eo < e\, we have 

dr 

\P — (3\ <Eq and v E V implies 



dp 




dv r 


v=v 



<r/,l<r<i? 





d B 2 r\B 3 nB 5 nC 2 . 








Proof. 


We have 








(9.1) 




R 

E 

r=l 


dv r 


dv r 

W q 


(9.2) 


dp 

dv r 


P 

E 

9=1 


dv 

W q 


dv r 



Let v r = v(j,k,l) for some j, k, I. Then, by (3.2), we have 



dv r 1 
W q P=P~ ~ n(j, k, I) 

1 



n(j,k,l) 



E 

iei(j,k,i) 

E 



(dgj 



d ^(X i} P)\{Y lk -g k (X u P)} 



I a/3. 



= -(h + h). 
Furthermore, we have 



1 

n(j,k,l) 



E 

i£l(j,k,l) 



\d(3 q 



{X h l3)\{Y ik - m ) 
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= Al + ^12 + ^13- 

Suppose that |/3 — /3| < £o- Then, it is easy to show that | J13I < L\eq; \Ii 2 \ < 

1 /2 1 /2 A 

I/2£ on B 2 ; and |In| < e on f?5. Thus, on B 2 n -B5, |/3 — /3| < £0 im- 
plies |ii| < L\eq + (L 2 + l)^ 2 , and, similarly, |J 2 | < L\eq + (£2 + l)^ 2 - 
Therefore, by (9.1), we have, on B 2 n B 3 n B 5 n C 2 , that |/3 - /3| < e and 
■u E V imply |(9r/9/3 g )L_3| < ??• The arguments also show that, on B 2 H S5, 

|/3 - /3| < e implies \(dv/dfi q )\ p=$ \ < 2^/R{L\e Q + (L 2 + 1)4 /2 }. Thus, by 

(9.2), we have, on B 2 n B 3 n £5 n C 2 , that |/3 - /3| < e and u G V imply 
|(dp/cky)| u=fi | <?7. □ 

We are now ready for the proof. For any 5 > 0, by Assumption A2, there 
is M > such that P(A) > 1 - 5, where .4 = {Lj < M,j = 1, 2, and ||^|| V 
ll^i" 1 )! < -M, 1 < 2 i < n}. Then, choose 5o and Mq as in the third paragraph 
of this section such that A m i n (Vi) > 4<5o and A max (Vi) < Mo/4, 1 < i < n, on 
A. Furthermore, if £0 > is chosen such that 

e <ei, b\2el /2 + M 2 e 2 Q )<5 , 



R(RVp){M 2 el /A + (M + l)el /4 }<^ 

(e± is defined below Assumption A5), we have Cj D A, j = 0, 1, 2. 

By Assumption A3, there is Nq such that P(-Bq) <S, if n> Nq. 

We have v(j, k, l) — v(j, k, I) = n~ l (j, k,l) J2iei(j,k,l) A i,j,k for any (j,k) G D, 
1 < I < L jk , where A ijifc = (1^- Hij)(Y ik - fi ik )-v ijk . It follows, by Assump- 
tion A2, that E{6(i,M) - «(i,*,0} 2 = n ~ 2 (i^>0E ie 7( i)fc ,0 E ( A Lfe) ^ 
en (j,k,l) for some constant c. It is then easy to show, by Assumption 
A5, that there is Ni such that P(max; \\Vi — Vi\\ > So) < 5, if n > N\. Since 
B\D An {max; \\Vi -Vi\\< 5 }, we have P(Bf) < 25, if n > N v 

Similarly, there is N4 such that P(-B|) < 5, if n > N 4 . 

Now consider i? 2 . It is easy to see that B 2 D AC) G, where 

G A^,M/' i -' HA< - lM+ir%1>2 ' 

l<l<L jk ,(j,k)£D\. 
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By Chebyshev's inequality and Assumptions Al, A2, we have 



P(G C ) < Yl 

i<i<L jk ,(j,k)eD 

< Cl (M + l)el /2 



(m + 1)4 /2 

n(j,k,l) 



J2 E(|^-^|) 



iei(j,k,i) 



1/2 

for some constant c\. Thus, we have P(B%) < 5 + c\(M + 1)£ • 

Also, by Assumption A4, there is £2 > such that P(B^) < 5, if £0 < £2- 
Finally, for any 1 < q<p, (j,k) £ D and 1 < I < Ljk, write 



S, 



1 



n{j,k,l) 



iei(j,k,i) 



> £ 



1/2 



and A(i) = {\dfiij /d[3 q \ < M,Vikk < M}. Then we have ^A(i) = 1 Vi on A. 
Thus, we have 



< 



e n 2 (j,k,l) 



i€l(j,k,l) 



iei(j,k,i) 



> £ 



1/2 



< 



M 3 



£ n(j,k,l) 

Therefore, there is N5 which depends on M and £0, such that P(S q j t k,i^-A) < 
S/h, for all such q, j, k and /, if n > .ZV5, where h is the cardinality of 
the set H = {(q,j,k,l) : 1 < q < p, (j, k) £ D,l < Lj^}, which is bounded by 
Assumption Al. It follows that 

P(BI nA)< ]T P(S qJAl nA)<5, 

(q,j,k,l)eH 

and hence P(B£) < P(£§ n A) + P(-4 C ) < 25, if n > A%. 

In conclusion, for any 6 > 0, choose £0 > such that £0 < £1 A £2 A 

1, b 2 (2el /2 + M 2 £g) < S , ^/R(RVp){M 2 £ 3 /4 + (M + 1)£q /4 } < n/2 and 
1/2 

ci(M + 1)£ < S. Then, choose Nj, j = 0,1,4,5, as above, and let N 
be the maximum among them. It follows that P(Cj) <5, j = 0,1,2, and, 
when n > N, P(B?) <6, j = 0, 3, 4, and P(£?) < 25, j = 1, 2, 5. Thus, with 
£ = B n • • • n B 5 n C n d n C 2 , we have P(£ c ) < 125, if n > N. 

Now consider what happens on £. For any /3W such that — /3| < £0, 
j = 1, 2, we have 

T{/3 (1) } _ r{ ^(2) } = [T?{j g(l) } _ r^W}]^. 



ITERATIVE ESTIMATING EQUATIONS 27 
By Taylor expansion we have 

T q {^-r q {^ = ±^\^-^}, 



where j3 lies between (3^ and (but may depend on q). Note that \(3v) — 
P\ < £o 5 j = 1)2, implies \$ — (5\ < eo- Thus, by Lemma 4, we have v £ V. 
Therefore, by Lemma 6 and the Cauchy-Schwarz inequality, we have 

|r g {/3 (1) }-^ (2) }|<^l/? (1) -/3 (2) l, 

hence 

(9.3) |r{/? {1) } - r{/3( 2 )}| < pi]\(5^ - pW\. 

Similarly, by Lemma 5 and Lemma 6 it can be shown that for any u«J £ V, 
j = 1, 2, we have 

(9.4) \p{v {1) } - p{v {2) }\ < Rv\v {1) - v {2) \. 

Now, on £ we have \(3^ - 0\ < e and flW = £ V; |/3 (2) - /3| = - 
j3\ < eo, hence by Lemma 4, {/( 2 ) £ V; thus, by Lemma 5, \(3^ — f3\ < £oj 
and t/ 3 ) = u< 2 ) £ V; and so on. It follows that - f3\ < e and uM £ V, 

m = 1, 2, 

We now apply (9.3) to obtain, for any a,b> k, 

| / g(2a-l) _ £(26-1)1 = | T |^(2a-3) } _ r {£(26-3) } | 

<(p??)l/5 (2a - 3) -/3 {2fc - 3) l 

(9.5) <••• 

< (p??) fc |/3 (2a " 2fc " 1) - p( 2b ~ 2k - 1 )\ 

It follows that the sequence {[5^ 2a ~ 1 ^ , a = 1, 2, . . .} is a Cauchy sequence and 
hence convergent. Furthermore, since /?(2a) = Z^ 20 " 1 ), the entire sequence 
{$^ m \ m = 1,2,...} converges. By letting b — > oo and k = a — 1 in (9.5), it is 
easy to show that 

< 2 £o _^ m=i2 
(prj) m / 2 ~ prj 

Similarly, by (9.4), it can be shown that - \ < (Rr ] ) k \v ( - 2a ~ 2 ^ - 
-(26-2fc)| <- (2y r RMo)(i?r?) fc for any a,b>k. Thus, by similar arguments, the 
sequence {ti( m ) , m = 1, 2, . . .} converges, and we have 



(Rrj) m / 2 



<2M jR\Zr]~ 1 = M 2 , m = l,2,.. 
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Let 



sup{|/3( m ) 



/3*i/(p??r /2 } 



m>l 



sup{|-0 (m) 



v*\/(Rr,) m ' 2 }. 



m>l 



We have shown that for any 5 > 0, there is N > 1 such that, when n> N, 
P(IEE converges) > P(£) > 1 - 125 and P(Cj < oo) > P(Q < Mj) > P(£) > 
1 — 125, j = 1, 2. This completes the proof. 
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